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WISHART DISTRIBUTIONS FOR DECOMPOSABLE COVARIANCE 

GRAPH MODELS 

By Kshitij Khare^ and Bala Rajaratnam^ 
University of Florida and Stanford University 

Gaussian covariance graph models encode marginal independence 
among the components of a multivariate random vector by means 
of a graph G. These models are distinctly different from the tradi- 
tional concentration graph models (often also referred to as Gaussian 
graphical models or covariance selection models) since the zeros in 
the parameter are now reflected in the covariance matrix E, as com- 
pared to the concentration matrix f2 = E~^. The parameter space of 
interest for covariance graph models is the cone Pq of positive defi- 
nite matrices with fixed zeros corresponding to the missing edges of 
G. As in Letac and Massam [Ann. Statist. 35 (2007) 1278-1323], we 
consider the case where G is decomposable. In this paper, we con- 
struct on the cone Pg a family of Wishart distributions which serve a 
similar purpose in the covariance graph setting as those constructed 
by Letac and Massam [Ann. Statist. 35 (2007) 1278-1323] and Dawid 
and Lauritzen [Ann. Statist. 21 (1993) 1272-1317] do in the concen- 
tration graph setting. We proceed to undertake a rigorous study of 
these "covariance" Wishart distributions and derive several deep and 
useful properties of this class. First, they form a rich conjugate fam- 
ily of priors with multiple shape parameters for covariance graph 
models. Second, we show how to sample from these distributions by 
using a block Gibbs sampling algorithm and prove convergence of 
this block Gibbs sampler. Development of this class of distributions 
enables Bayesian inference, which, in turn, allows for the estima- 
tion of E, even in the case when the sample size is less than the 
dimension of the data (i.e., when "n <p"), otherwise not generally 
possible in the maximum likelihood framework. Third, we prove that 
when G is a homogeneous graph, our covariance priors correspond 
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to standard conjugate priors for appropriate directed acyclic graph 
(DAG) models. This correspondence enables closed form expressions 
for normalizing constants and expected values, and also establishes 
hyper-Markov properties for our class of priors. We also note that 
when G is homogeneous, the family IW Qq of Letac and Massam 
[Ann. Statist. 35 (2007) 1278-1323] is a special case of our covariance 
Wishart distributions. Fourth, and finally, we illustrate the use of our 
family of conjugate priors on real and simulated data. 

1. Introduction. Due to recent advances in science and information tech- 
nology, there has been a huge influx of high-dimensional data from various 
fields such as genomics, environmental sciences, finance and the social sci- 
ences. Making sense of all the many complex relationships and multivariate 
dependencies present in the data, formulating correct models and devel- 
oping inferential procedures is one of the major challenges in modern day 
statistics. In parametric models, the covariance or correlation matrix (or its 
inverse) is the fundamental object that quantifies relationships between ran- 
dom variables. Estimating the covariance matrix in a sparse way is crucial in 
high-dimensional problems and enables the detection of the most important 
relationships. In this light, graphical models have served as tools to discover 
structure in high-dimensional data. 

The primary aim of this paper is to develop a new family of conjugate prior 
distributions for covariance graph models (a subclass of graphical models) 
and study the properties of this family of distributions. It is shown in this 
paper that these properties are highly attractive for Bayesian inference in 
high-dimensional settings. In covariance graph models, specific entries of the 
covariance matrix are restricted to be zero, which implies marginal indepen- 
dence in the Gaussian case. Covariance graph models correspond to curved 
exponential families and are distinctly different from the well-studied con- 
centration graph models, which, in turn, correspond to natural exponential 
families. 

A rich framework for Bayesian inference for natural exponential families 
has been established in the last three decades, starting with the seminal and 
celebrated work of Diaconis and Ylvisaker [10] that laid the foundations 
for constructing conjugate prior distributions for natural exponential family 
models. The Diaconis- Ylvisaker (henceforth referred to as "DY") conjugate 
priors are characterized by posterior linearity of the mean. An analogous 
framework for curved exponential families is not available in the literature. 

Concentration graph models (or covariance selection models) were one of 
the first graphical models to be formally introduced to the statistics commu- 
nity. These models reflect conditional independencies in multivariate prob- 
ability distributions by means of a graph. In the Gaussian case, they induce 
sparsity or zeros in the inverse covariance matrix and correspond to natu- 
ral exponential families. In their pioneering work, Dawid and Lauritzen [9] 
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developed the DY prior for this class of models. In particular, they intro- 
duced the hyper-inverse Wishart as the DY conjugate prior for concentration 
graph models. In a recent major contribution to this field, a rich family of 
conjugate priors that subsumes the DY class has been developed by Letac 
and Massam [20]. Both the hyper-inverse Wishart priors and the "Letac- 
Massam" priors have attractive properties which enable Bayesian inference, 
with the latter allowing multiple shape parameters and hence being suitable 
in high-dimensional settings. Bayesian procedures corresponding to these 
Letac-Massam priors have been derived in a decision theoretic framework 
in the recent work of Rajaratnam, Massam and Carvalho [26]. 

Consider an undirected'^ graph G with a finite set of vertices V (of size 
p) and a finite set E of edges between these vertices, that is, G = iV^E). 
The Gaussian covariance graph model corresponding to the graph G is the 
collection of p-variate Gaussian distributions with covariance matrix S such 
that Sjj = whenever ^ E. This class of models was first formally 
introduced by Cox and Wermuth [6, 7]. In the frequentist setting, max- 
imum likelihood estimation in covariance graph models has been a topic 
of interest in recent years. Many iterative methods that obtain the maxi- 
mum likelihood estimate have been proposed in the literature. The graphical 
modeling software MIM in Edwards [12] fits these models by using the "dual 
likelihood method" from Kauermann [17]. In Wermuth, Cox and Marchetti 
[31], the authors derive asymptotically efficient approximations to the maxi- 
mum likelihood estimate in covariance graph models for exponential families. 
Chaudhuri, Drton and Richardson [4] propose an iterative conditional fit- 
ting algorithm for maximum likelihood estimation in this class of models. 
Covariance graph models have also been used in applications in Butte et 
al. [3], Grzebyk, Wild and Chouaniere [15], Mao, Kschischang and Frey [21] 
and others. 

Although Gaussian covariance graph models are simple and intuitive to 
understand, no comprehensive theoretical framework for Bayesian inference 
for this class of models has been developed in the literature. In that sense, 
Bayesian inference for covariance graph models has been an open problem 
since the introduction of these models by Cox and Wermuth [6, 7] more than 
fifteen years ago. The main difficulty is that these models give rise to curved 
exponential families. The zero restrictions on the entries of the covariance 
matrix S translate into complicated restrictions on the corresponding entries 
of the natural parameter, VL = Hence, the sparseness in S does not 
translate into sparseness in and thus a covariance graph model cannot 
be viewed as a concentration graph model. No general theory is available 



^We shall use dotted edges for our graphs, in keeping with the notation in the literature; 
bi-directed edges have also been used for representing covariance graphs. 
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for Bayesian inference in curved exponential families for continuous random 
variables, akin to the Diaconis-Ylvisaker [10] or standard conjugate theory 
for natural exponential families. 

There are several desirable properties that one might want when con- 
structing a class of priors, but one of the foremost requirements is to be able 
to compute quantities such as the mean or mode of the posterior distribution, 
either in closed form or by sampling from the posterior distribution by a sim- 
ple mechanism. This is especially important in high-dimensional situations, 
where computations are complex and can become infeasible very quickly. 
Another desirable and related feature is conjugacy, that is, the class of pri- 
ors is such that the posterior distribution also belongs to this class. Among 
other things, this increases the prospects of obtaining closed form Bayes es- 
timators and can also add to the interpret ability of the hyper-parameters. 
The class of Wishart distributions developed by Letac and Massam [20] (and 
later used for flexible Bayesian inference for concentration graph models by 
Rajaratnam, Massam and Carvalho [26]), known as the IW family of dis- 
tributions, are not appropriate for the covariance graph setting. There is the 
additional option of using the IWq^^ class as priors for this situation. We, 
however, establish that the posterior distribution fails to belong to the same 
class and there are no known results for computing the posterior mean or 
mode, either in closed form or by sampling from the posterior distribution. 

A principal objective of this paper is to develop a framework for Bayesian 
inference for Gaussian covariance graph models. We proceed to construct 
a rich and flexible class of conjugate Wishart distributions, with multiple 
shape parameters, on the space of positive definite matrices with fixed ze- 
ros, that corresponds to a decomposable graph G. This class of distributions 
is specified up to a normalizing constant, and conditions under which this 
normalizing constant can be evaluated in closed form are derived. We ex- 
plore the distributional properties of our class of priors and, in particular, 
show that the parameter can be partitioned into blocks so that the condi- 
tional distribution of each block, given the others, is tractable. Based on this 
property, we propose a block Gibbs sampling algorithm to simulate from the 
posterior distribution. We proceed to formally prove the convergence of this 
block Gibbs sampler. Our priors yield proper inferential procedures, even in 
the case when the sample size n is less than the dimension p of the data, 
whereas maximum likelihood estimation is, in general, only possible when 
n>p (in fact, in the homogeneous case, it can be shown that the condition 
n > p is actually also necessary, thus highlighting the fact that results from 
the concentration graph setting do not carry over to the covariance model 
setting). We also show that our covariance Wishart distributions are, in the 
decomposable nonhomogeneous case, very different from the Letac-Massam 
priors Wp^ and IWq^ . However, when the underlying graph G is homoge- 
neous, the Letac-Massam IWq^^ priors are a special case of our distribu- 
tions. We establish, in the homogeneous setting, a correspondence between 
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the covariance priors in this paper and the natural conjugate priors for ap- 
propriate directed acychc graph (DAG) models. This correspondence helps 
us to explicitly evaluate quantities like the normalizing constant and the 
posterior mean of the covariance matrix in closed form. In this scenario, we 
also show that our class of priors satisfies the strong directed hyper-Markov 
property (as introduced in Dawid and Lauritzen [9] for concentration graph 
models). It should be pointed out that these aforementioned results for ho- 
mogeneous graphs can also be established directly, without exploiting the 
correspondence with the DAG models. The direct approach is self-contained, 
whereas the latter invokes an external result which states that for the re- 
strictive class of homogeneous graphs, covariance graph models and DAGs 
are Markov equivalent. 

We noted above that for concentration graph models or the traditional 
Gaussian graphical models, a rich theory has been established by Dawid and 
Lauritzen [9], who derive the single parameter DY conjugate prior for these 
models, and by Letac and Massam [20], who derive a larger flexible class with 
multiple shape parameters. In essence, this paper is the analog of the results 
in the two aforementioned papers in the covariance graph model setting, with 
parallel results, all of which are contained in a single comprehensive piece. 
Hence, this work completes the powerful theory that has been developed in 
the mathematical statistics literature for decomposable models. 

We also point out that a class of priors in the recent work [29] is a special 
case of our class of flexible covariance Wishart distributions.'^ Our family 
allows multiple shape parameters, as compared to a single shape parame- 
ter, and hence yields a richer class suitable to high-dimensional problems. 
Moreover, we show that their iterative algorithm to sample from the poste- 
rior is different from ours. Since the authors do not undertake a theoretical 
investigation of the convergence properties of their algorithm, it is not clear 
if it does indeed converge to the desired distribution. On the other hand, 
we proceed to formally prove that our algorithm converges to the desired 
distribution. The remaining sections of this paper are considerably different 
from [29] since we undertake a rigorous probabilistic analysis of our conju- 
gate Wishart distributions for covariance graph models, whereas they give 
a useful and novel treatment of latent variables and mixed graph models in 
a machine learning context. 

This paper is structured as follows. Section 2 introduces the required 
preliminaries and notation. In Section 3, the class of covariance Wishart 
distributions is formally constructed. Conjugacy to the class of covariance 



^This is in a similar spirit to the way in which the HIW prior of Dawid and Lauritzen 
[9] is a special case of the generalized family of Wishart distributions proposed by Letac 
and Massam [20] for the concentration graph setting. 
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graph models and sufficient conditions for integrability are established. Com- 
parison with the Letac-Massam IWq^ priors, which are not, in general, 
conjugate in the covariance graph setting, is also undertaken. In Section 
4, a block Gibbs sampler which enables sampling from the posterior dis- 
tribution is proposed and the corresponding conditional distributions are 
derived. Thereafter, a formal proof of convergence of this block Gibbs sam- 
pler is provided. In Section 5, we restrict ourselves to the case when G is a 
homogeneous graph. We examine the distributional properties of our class 
of priors in this section and prove that the covariance priors introduced in 
this paper correspond to natural conjugate priors for DAG models in the 
homogeneous setting. This correspondence helps in establishing closed form 
expressions for normalizing constants, expected values and hyper-Markov 
properties for our class of priors for G homogeneous. Finally, we illustrate 
the use of our family of conjugate priors and the methodology developed in 
this paper on a real example, as well as on simulated data. The Appendix 
contains the proofs of some of the results stated in the main text. 

2. Preliminaries. In this section, we give the necessary notation, back- 
ground and preliminaries that are needed in subsequent sections. 

2.1. Modified Cholesky decomposition. If S is a positive definite matrix, 
then there exists a unique decomposition 

(2.1) T, = LDL'^, 

where L is a lower-triangular matrix with diagonal entries equal to 1 and D 
a diagonal matrix with positive diagonal entries. This decomposition of S is 
referred to as the modified Cholesky decomposition of S (see [25]). We now 
provide a formula that explicitly computes the inverse of a lower-triangular 
matrix with I's on the diagonal, such as those that appear in (2.1). 

Proposition 1 . Let L be an m x m lower-triangular matrix with 1 's 
on the diagonal. Let 

m 

A = |J{t:tG {l,2,...,m}^ri <7i_i V2 < i < r} 

r=2 

and 

dim(T) 
i=2 

where dim(r) denotes the length of the vector r. Then, L^^ = N, where N 
is lower-triangular matrix with 1 's on the diagonal and, for i > j , 

dim(r) 

N^,= E (_l)dimM-l -Q 

rG^,ri=i,TdimT=i «=2 
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The proof is provided in the Appendix. 

An undirected graph G is a pair {V, E), where y is a permutation^ of the 
set {1,2,..., m} denoting the set of vertices of G. The set E <^V xV denotes 
the set of edges in the graph. If vertices u and v are such that {u,v) G E, 
then we say that there is an edge between u and v. It is also understood 
that {u,v) £ E imphes that {v,u) G E, that is, the edges are undirected. 
Although the dependence of G = {V, E) on the particular ordering in V is 
often suppressed, the reader should bear in mind that unlike traditional 
graphs, the graphs defined above are not equivalent up to permutation of 
the vertices® modulo the edge structure. Below, we describe two classes of 
graphs which play a central role in this paper. 

2.2. Decomposable graphs. An undirected graph G is said to be decom- 
posable if any induced subgraph does not contain a cycle of length greater 
than or equal to four. The reader is referred to Lauritzen [19] for all of 
the common notions of graphical models (and, in particular, decomposable 
graphs) that we will use here. One such important notion is that of a perfect 
order of the cliques. Every decomposable graph admits a perfect order of its 
cliques. Let (Ci, C2, . . . , Gk) be one such perfect order of the cliques of the 
graph G. The history for the graph is given by Hi = Gi and 

Hj = GiUG2U---UGj, j = 2,^,...,k, 

and the minimal separators of the graph are given by 

S,=Hj^inG,, j = 2,3,...,k. 

Let 

Rj = Gj \ Hj^i for j = 2,3,...,k. 

Let k' <k — 1 denote the number of distinct separators and 1^(5*) denote 
the multiplicity of S, that is, the number of j such that Sj = S. Generally, we 
will denote by C the set of cliques of a graph and by S its set of separators. 

Now, let S be an arbitrary positive definite matrix with zero restrictions 
according to G = {V,E),'^ that is, Sjj = whenever (i, j) ^ E. It is known 
that if G is decomposable, then there exists an ordering of the vertices such 
that if S = LDL^ is the modified Cholesky decomposition corresponding to 
this ordering, then, for i > j, 

(2.2) Lij = whenever (i, j) ^ E. 



^The ordering in V is emphasized here since the elements of V will later correspond to 
rows or columns of matrices. 

^This has been done for notational convenience, as will be seen later. 

''it is emphasized here that the ordering of the vertices reflected in V plays a crucial 
role in the definitions and results that follow. 
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Although the ordering is not unique in general, the existence of such an 
ordering characterizes decomposable graphs (see [24] ) . A constructive way to 
obtain such an ordering is given as follows. Label the vertices in descending 
order, starting with vertices in Ci, R2, R3, Rk, with vertices belonging to 
a particular set being ordered arbitrarily (see [19, 24, 30] for more details). 

2.3. The spaces Pq, Qq and Cq- An m-dimensional Gaussian covari- 
ance graph model*^ can be represented by the class of multivariate normal 
distributions with fixed zeros in the covariance parameter (i.e., marginal in- 
dependencies) described by a given graph G = {V,E). That is, if (i,j) ^ 
then the ith and jth components of the multivariate random vector are 
marginally independent. Without loss of generality, we can assume that 
these models have mean zero and are characterized by the parameter set 
Pg of positive definite covariance matrices S such that Sj^ = whenever 
the edge (i, j) is not in E. Following the notation in [20, 26] for G decom- 
posable, we define Qq to be the space on which the free elements of the 
precision matrices (or inverse covariance matrices) O live. 

More formally, let M denote the set of symmetric matrices of order m, 
C M the cone of positive definite matrices (abbreviated as "> 0"), Iq 
the linear space of symmetric incomplete matrices x with missing entries 
Xij, ^ E, and k:M Iq the projection of M into Iq- The parameter 
set of the precision matrices of Gaussian covariance graph models can also be 
described as the set of incomplete matrices Q = Av(S~^), S € Pq- The entries 
^ijj {hi) ^ are not free parameters of the precision matrix for Gaussian 
covariance graph models (see [20, 26] for details). We are therefore led to 
consider the two cones 



where Pq C Zq, Qg C Iq and Zq denotes the linear space of symmetric 
matrices with zero entries yij, ^ E. Furthermore Grone et al. [14] prove 
that for G decomposable, the spaces Pq and Qg are isomorphic (once more, 
see [20, 26] for details). 

We now introduce new spaces jCq and @g (the modified Cholesky space) 
that will be needed in our subsequent analysis^: 



Cg = {L : Lij = whenever i < j, or (i, j) ^ E, and La = 1, VI < i,j < m}; 
Sg = {0 = {L, D):LeCG,D diagonal with Da > \/l < i < m}. 



(2.3) 
(2.4) 



PG = {yeM+\y,,=0,{i,j)(^E}, 
Qg = {x e Ig\xc, >0,i = l,...,k} 



*A brief overview of the literature in this area is provided in the Introduction. 
^These spaces are not defined in [20, 26]. 
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We define the mapping ip : @g as follows: 

(2.5) Tp{L,D)=LDL'^. 

This mapping tp plays an important role in our analysis and shall be studied 
later. 

2.4. Homogeneous graphs. A graph G = {V, E) is defined to be homoge- 
neous if, for all G either 

{u:u= j or (n, j) G C {n : u = i or {u,i) £ E} 

or 

{u:u = i or {u,i) G E} C {u:u=j or {u,j) G E} . 
Equivalently, a graph G is said to be homogeneous if it is decomposable 

12 3 4 

and does not contain the graph • — • — • — denoted by A4, as an induced 
subgraph. Homogeneous graphs have an equivalent representation in terms 
of directed rooted trees, called Hasse diagrams. The reader is referred to [20] 
for a detailed account of the properties of homogeneous graphs. We write 
i — )• j whenever 

{u:u = j or {u,j) G E} {u : u = i or {u, i) £ E}. 

Denote by R the equivalence relation on V defined by 

iRj <^ i ^ j and j — t- i. 

Let i denote the equivalence class in V/R containing i. The Hasse diagram 
of G is defined as a directed graph with vertex set Vh = V/R = {i:i£ V} 
and edge set Eh consisting of directed edges with G Eh for j if the 
following holds: J — )• j and $k such that i ^ k ^ j , k^i, k ^ j. 

If G is a homogeneous graph, then the Hasse diagram described above is 
a directed rooted tree such that the number of children of a vertex is never 
equal to one. It was proven in [20] that there is a one-to-one correspondence 
between the set of homogeneous graphs and the set of directed rooted trees 
with vertices weighted by positive integers [w{i) = such that no vertex 
has exactly one child. Also, when iRj, we say that i and j are twins in the 
Hasse diagram of G. Figure 1 provides an example of a homogeneous graph 
with seven vertices and the corresponding Hasse diagram. 

The following proposition for homogeneous graphs plays an important 
role in our analysis. 

Proposition 2. If G is a homogeneous graph, then there exists an or- 
dering of the vertices, such that, for this ordering: 
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1. S G Pg <^ L £ Cg, where S = LDL^ is the modified Cholesky decompo- 
sition of S; 

2. LeCc^L-^eCc. 

The proof of this proposition is well known and so is omitted for the 
sake of brevity (see [1, 18, 27]). We now describe a procedure for ordering 
the vertices, under which Proposition 2 holds. Given a homogeneous graph 
G, we first construct the Hasse diagram for G. The vertices are labeled 
in descending order, starting from the root of the tree. If the equivalence 
class at any node has more than one element, then they are labeled in any 
order. Hereafter, we shall refer to this ordering scheme as the Hasse perfect 
vertex elimination scheme. For example, if we apply this ordering procedure 
to the graph in Figure 1, then the resulting labels are {a,b,c,d,e, f, g} — ^ 
{4,5,1,3,7,6,2}. 

2.5. Vertex ordering. Let G = {V, E) be an undirected decomposable 
graph with vertex set F = {1, 2, . . . , m} and edge set E. Let Sy denote the 
permutation group associated with V . For any a € Sy., let G^ := {a{y),Ea), 
where (n, -y) G E^ if and only if {u) , {v)) G E. Let Sd C Sy de- 
note the subset of permutations a oi V such that, for any S G with 
S = LDL^, L G £g<t 5] G -Pgct- Hence, for every a G Sd, the mapping 
i^a- '■ &Ga ~^ defined in (2.5) is a bijection from @g^ to Pg^- In partic- 
ular, the ordering corresponding to any perfect vertex elimination scheme 
lies in Sd (see Section 2.2). If G is homogeneous, let Sh ^ Sd denote the 
subset of permutations a oi V such that L G vCg^ ^ ^Ga- 1^ partic- 

ular, any ordering of the vertices corresponding to the Hasse perfect vertex 
elimination scheme lies in Sh (see Section 2.4). The above defines a nested 
triplet of permutations of V given by Sh ^ •S'z) C 5y. 



• d 



f • 



• e 



• b 



.1, a={a,d} g c 

(a) (b) 

Fig. 1. (a) An example of a homogeneous graph with 7 vertices; (b) the correspon 
Hasse diagram. 
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3. Wishart distributions for covariance graphs. Let G = {V, E) be an 

undirected decomposable graph with vertex set V and edge set E. We as- 
sume that the vertices in V are ordered so that V G So- The covariance 
graph model associated with G is the family of distributions 

g = {AA^(0,S):SGPG} 
^{Mm{O^LDL^):{L,D)&&G}. 
Consider the class of measures on 0^ with density [with respect to 

ni>i,{j,i)e£^-^u ni^i^Ai] 

(3.1) = e-(*'■«^^^^)"^)+^-^"»^°^^-)/^ e = {L,D) G 0G. 

These measures are parameterized by a positive definite matrix U and a 
vector a G M™ with nonnegative entries. Let us first establish some notation: 

. Nil) :={j:(z,j)GS}; 

. M^{i):={j:{i,j)eE,i>j]- 

• [/^« := {{Uki))k,i&M-<(i)i 

• U-' ■.= {{Uki))k,i(^M^{i)u{iy^ 

Let 

a) := I e-(tr((i^i")-^f^)+E™ 1 log ^0/2 

If zg{U,cx) < oo, then ttu^o. can be normalized to obtain a probability mea- 
sure. A sufficient condition for the existence of a normalizing constant for 
TTjj^ctiL, D) is provided in the following proposition. 

Theorem 1. Let dL:=Y[(^ij)(zEi>j(i^ij and dD := Y[^i dDu . Then, 

g-(tr((LDL^)-iC/)4-Er=i"a°gA.)/2^^^^ < ^ 

©G 

ai > |7^^(z)| +2 Vi = l,2,...,m. 

As the proof of this proposition is rather long and technical, it is de- 
ferred to the Appendix. The normalizing constant ZG{U,a) is not generally 
available in closed form. Let us consider a simple example to illustrate the 
difficulty of computing the normalizing constant explicitly. 

12 3 4 

Let G = A4, that is, the path on four vertices, or • — • — • — •. Note that 
this is a decomposable (but not homogeneous) graph. The restrictions on L 



12 



K. KHARE AND B. RAJARATNAM 



are L31 = L41 = L42 = 0. Let U € Pq and a = (4,4,4,4). Then, after inte- 
grating out the elements Da, 1 < i < 4 (recognizing them as inverse-gamma 
integrals) and transforming the entries of L to the independent entries of 
(as in the proof of Proposition 1), the normalizing constant reduces to 
an integral of the form 

/ {U22 + 2Ui2Xi + Uuxl)-^ 

Jm.3 

X {Unxlxl + U22XI + f/33 + 2C/i2Xix^ + 2[/i3XiX2 + 2C/23a:2)^^ 

X {Uiix\xlxl + 1/22X^x1 + U33XI + Uu + 2?7i2XiX2X3 

-I- 2C/i3XiX2x| -I- 2C/14X1X2X3 -I- 2[/23a;2a;i + 2C/24X2X3 -|- Us^x^)"^ dx. 

The above integral does not seem to be computable by standard techniques 
for general U. Despite this inherent difficulty, we propose a novel method 
which allows sampling from this rich family of distributions (see Section 4). 

We will show later that the condition in Theorem 1 is necessary and 
sufficient for the existence of a normalizing constant for homogeneous graphs. 
Moreover, in this case, the normalizing constant can be computed in closed 
form. We denote by 7ru,c( the normalized version of 7f(/^ct whenever zg{U, a) < 
00. The following lemma shows that the family tt^/.q is a conjugate family 
for Gaussian covariance graph models. 

Lemma 1. Let G = {V^E) he a decomposable graph, where vertices in 
V are ordered so that V ^ Sd- Let Yi, Y2, . . . , Y„ he an i.i.d. sample from 
A/m(0, LDL^), where {L,D) e &g Let S = ^EiLi^iYf denote the em- 
pirical covariance matrix. If the prior distribution on {L,D) is nu,a, then 
the posterior distribution of {L,D) is given by vr^ ~, where U = nS + U and 
a = {n + ai,n + a2, ■ ■ ■ ,n + am) ■ 

Proof. The likelihood of the data is given by 

/(yi,y2,...,y„ I L,i^) = -=i—e-(-«^^^")-Hn^))+™i°sPI)/2. 

(V 2vrj"™ 

Using TTu^ct as a prior for {L,D), the posterior distribution of {L,D) given 
the data (Yi, Y2, . . . , Y„) is 

7r[;,„(L,Z)| Yi,Y2,...,Y„) 

^ g-(tr((LDL^)-i(nS+C/))+E™i(n+«01ogAi)/2^ Q ^ 

Hence, the posterior distribution belongs to the same family as the prior, 
that is, 

vr(7,a(- I Yi,Y2,...,Y„) =vr^^5(-), 
where U = nS -\- U and a = {n -\- ai,n -\- a2, ■ ■ ■ ,n + am)- D 
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Remark. If we assume that the observations have unknown mean //, 
that is, Yi,Y2,...,Y„ are i.i.d. A/'(/i,S) with /xG]R'",SgPg, then 



is the minimal sufficient statistic for S. Here, nS has a Wishart distribution 
with parameter S and n — 1 degrees of freedom. Hence, if we assume a prior 
'^U,a for {L,D), then the posterior distribution is given by 



where U = nS + U and a = {n — \ + ai^n — 1 + 02, . . . , n — 1 + am)- 

Remark. Note that, as with the distributions in [20, 26], the func- 
tional form of the prior distribution depends on the ordering of the ver- 
tices specified — but this is not as restrictive as it first appears. In this sense, 
an ordering is essentially another "parameter" to be specified and thus can 
also be viewed as imposing extra information. We return to this point in 
the examples section where we investigate the impact of ordering on a real- 
world example (see Section 6). But, more importantly, given a perfect or- 
dering of the vertices, any rearrangement of the vertices within the residuals 
Rj = Cj\Hj^i will still preserve the zeros between S and L, and will thus 
be sufficient for our purposes. In this sense, the covariance Wishart distri- 
butions introduced in this paper do not actually depend on a full ordering 
of the vertices. In fact, for the class of decomposable graphs, any perfect 
ordering is sufficient, that is, any ordering that is used in [20] will also be 
relevant for the covariance Wishart distributions defined above. In this sense, 
these decomposable covariance Wishart distributions are very flexible, espe- 
cially since we are working in the curved exponential family setting and are 
still able to use any ordering that is appropriate for the [20] distributions 
which address the natural exponential family (NEF) concentration graph 
situation. The technical reason why any perfect ordering will suffice is that 
any perfect ordering will preserve the zeros between S and the matrix L 
from its Cholesky decomposition [24, 27]. Moreover, from an applications 
perspective, since matrix operations are not invariant with respect to order- 
ing of the nodes, an ordering that facilitates calculations is desirable. All 
that the ordering does is to relabel the vertices, but the edge structure is 
completely and fully retained. To further clarify what is meant, if one has 
a list of a genes/proteins called ABLIMl, BCL6, etc. and their names are 
replaced with the numbers 1, 2, 3, etc., the problem can first be analyzed 
with the integer labels and one can then go back to the original labels after 
the analysis is done. So, in many applications, the ordering is not a real 
restriction. 



^--E(Y^-Y)(Y.-Yf 



i=l 
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3.1. Induced prior on Pq and Qg- The prior iru^a on ©c (the modified 
Cholesky space) induces a prior on Pq (the covariance matrix space) and Qg ■ 
We provide an expression for the induced priors on these spaces in order to 
compare our Wishart distributions with other classes of distributions. Note 
that since the vertices have been ordered so that V G Sd, the transformation 

defined by 

V'(L,Z))=LDL^=:S 

is a bijection from &g to Pg- The lemma below provides the required Jaco- 
bians for deriving the induced priors on Pg and Qg- The reader is referred 
to Section 2.2 for notation on decomposable graphs. Note that if x is a ma- 
trix, then |x| denotes its determinant, while if C is a set, then \C\ denotes 
its cardinality. 

Lemma 2 (Jacobians of transformations). 

1. The Jacobian of the transformation ip: {L,D) — )• S from @g to Pg is 

m 
i=l 

Here, Djjijl) denotes that Djj is a function of S, and nj := \{i: € 
E,i >j}\ for j = 1,2,..., m. 

2. The absolute value of the Jacobian of the bijection C:x^ from Qg 
to Pg is 

cgc ses 

Proof. The first part is a direct consequence of a result in [28] and the 
proof of the second part can be found in [27]. □ 

These Jacobians allow us to compute the induced priors on Pg and Qg- 
The induced prior corresponding to TTjj^a on Pg is given by 

(3.2) ^^%{^) oc e-(f-(2-'^)+i:r=i(2n.+".)logi3,.(S))/2^ ^ ^ 

We first note that the traditional inverse Wishart distribution (see [23] ) with 
parameters U and n is a special case of (3.2) when G is the complete graph 
and ai = n — 2m + 2i,\/l < i < m. We also note that the ^-inverse Wishart 
priors introduced in [29] have a one-dimensional shape parameter 5 and are 
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a very special case of our richer class tt^*^. The single shape parameter 5 is 
given by the relationship Oj + 2nj = 6 + 2m, 1 <i < m}^ 

We now proceed to derive the induced prior on Qg- Let x = denote 
the image of S in Qg and let x denote (see [20, 26] for more details). 
Using the second part of Lemma 2, the induced prior corresponding to 'Ku,cx 
on Qg is given by 

n^G^{x) oc e-(t^(^^)+EIli(2n,+a,)logD,,((x)-l))/2 

3.2. Comparison with the Letac-Massam priors. We now carefully com- 
pare our class of priors to those proposed in Letac and Massam [20]. In 
[20], the authors construct two classes of distributions, named Wp^ and 
Wq(j , on the spaces Pg and Qg , respectively, for G decomposable (see [20] , 
Section 3.1). These distributions are generahzations of the Wishart distri- 
bution on these convex cones and have been found to be very useful for 
high-dimensional Bayesian inference, as illustrated in [26]. These priors lead 
to corresponding classes of inverse Wishart distributions IWp^ (on Qg) and 
/VFqq (on Pg), that is, C/ ~ IW whenever tj~^ ~ ^Pgi F ~ IWq^. 
whenever k,(V~^) ~ Wqq. In [20], it is shown that the family of distribu- 
tions IWp(^ yields a family of conjugate priors in the Gaussian concentration 
graph setting, that is, when S S Qg- 

As the IWq^^ priors of [20] are defined on the space Pg, in principle, 
they can potentially serve as priors^^ in the covariance graph setting since 
the parameter of interest S lives in Pg- Let us examine this class more 
carefully, first with a view to understanding their use in the covariance graph 
setting and second to compare them to our priors. Following the notation 
for decomposable graphs in Section 2.2 and in [20], the density of the IWq^. 
distribution is given by 

rr \(y-^\^A(^{c)+{c+i)/2 



There is an interesting parallel here that becomes apparent from our derivations 
above. In the concentration graph setting, the single shape parameter hyper-inverse 
Wishart (HIW) prior of Dawid and Lauritzen [9] is a special case of the multiple 
shape parameter class of priors introduced by Letac and Massam [20], in the sense that 
Qi = — ^(5 + Ci — 1) (see [26] for notation). In a similar spirit, we discover that the single 
shape parameter class of priors in [29] is a special case of the multiple shape parameter 
class of priors Try, a introduced in this paper, in the sense that ai = 5 — 2ni + 2m. 

^^The use of this class of nonconjugate priors for Bayesian inference in covariance graph 
models was already explored in [22]. 
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where U G Pc, and a{C), C € C and f3{S), S £ S are real numbers. The 
posterior density of S under this prior is given by 

IW /V I V V V W ^.-tr(S-i(C/+n5))/2 



Y\g^^\{j:-i)s\KSmS)+{^+l)/2)+nu(S)/2 ■ 



However, U + nS may not, in general, be in Pq, which is a crucial assumption 
in the analysis in [20]. Hence, the conjugacy breaks down. 

We now investigate similarities and differences between our class of priors 
and the IW class. Since the IWq^'^ density is defined only for U G Pg, 
a pertinent question is whether our class of priors has the same functional 
form when U G Pq- We discover that this is not the case and demonstrate 
this through an example. Consider the 4-chain A4. One can easily verify that 
the terms e"*''^^ ^^/^ are identical in both priors. We now show that the 
remaining terms are not identical. If S = LDL^ is the modified Cholesky 
decomposition of E, then, for this particular graph with Ci = {1,2}, C2 = 
{2,3}, C3 = {3,4} and ^2 = {3}, ^3 = {4}, the expression that is not in the 
exponential term for the IW density is of the form 



-2 
^43 



D22) Daa) V -^33-^44 

This expression is clearly different from the term, other than the expo- 
nent e~*''(^ ^^/"^ in vr^'^, which is a product of different powers of Dii,i = 
1,2,3,4. 

However, an interesting property emerges when G is homogeneous. Note 
that, in this case, for any clique C and any separator S", 

i&C iG5 

Hence, when G is homogeneous, the class IW is contained in the class 
vr^'^. The containment is strict because U need not be in Pq for our class 
vr^'^. Also, in IWq^, the exponent of Da and Djj is the same if iRj, that 
is, the shape parameter, is shared for vertices in the same equivalence class, 
as defined by the relation R. We, however, note that the difference in the 
number of shape parameters is not a major difference, due to the result 
of Consonni and Veronese [5], together with fact that for the Wq^ (and, 
correspondingly, for the IWq^), each one of the blocks has a Wishart 
distribution (see Theorem 4.5 of [20]). 



WISHART DISTRIBUTIONS 



17 



We therefore note that in the restrictive case when G is homogeneous and 
when U G Pq, the two classes of distributions tt^^ and IWq^. have the same 
functional form. The fact that we do not restrict U G Pq is an important 
difference since, even in the homogeneous case, they yield a larger class of 
distributions on the homogeneous cone Pq compared to those in Andersson 
and Wojnar [2], resulting in nonsuperficial consequences for inference in 
covariance graph models. 

4. Sampling from the posterior distribution. In this section, we study 
the properties of our family of distributions and thereby provide a method 
that allows us to generate samples from the posterior distribution corre- 
sponding to the priors defined in Section 3. In particular, we prove that 
= (L, D) G &G can be partitioned into blocks so that the conditional dis- 
tribution of each block given the others is a standard distribution in statistics 
and hence easy to sample from. We can therefore generate samples from the 
posterior distribution by using the block Gibbs sampling algorithm. 

4.1. Distributional properties and the block Gibbs sampler. Let us intro- 
duce some notation before deriving the required conditional distributions. 
Let G = {V,E) be a decomposable graph such that V G Sd- For a lower- 
triangular matrix L with diagonal entries equal to 1, 

Lu. := uth row of L, u = l,2,...,m, 

L.y := vih column of L, u = 1, 2, . . . , m, 

^% iI^uv)u>v,{u,v)eEj f = 1, 2, . . . , m — 1. 

So, is the vth. column of L without the components which are specified 
to be zero under the model G (and without the vth. diagonal entry, which is 
1). In terms of this notation, the parameter space can be represented as 

©G = {{L'^, L%, L%, L%^_i,D) : 

(4.1) 

Lij G M,V1 < j <i < m,{ij) G E,Dii > 0,V1 <i<m}. 

Suppose that 6 ~ Tru,a for some positive definite U and a G M™ with nonneg- 
ative entries. The posterior distribution is then vr^ ~, where U = nS + U, a = 
{n + ai,n + 02, ■ ■ ■ ,n + Om)- In the following proposition, we derive the dis- 
tributional properties which provide the essential ingredients for our block 
Gibbs sampling procedure. 



^^We note, however, that the distributions in [2] are quite general since the authors 
consider other homogeneous cones and not just Pq- 
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Theorem 2. Using the notation above, the conditional distributions of 
each component of [as in (4-1)] given the other components and the data 
Yi , Y2, . . . , Y„ are as follows: 
1. 

L% I {L\L%,D, Yi, Y2, . . . , Y„) ~ AA(/2^'^, M^'^) Vt; = 1, 2, . . . , m- 1, 

where 

i^^=iA.+ E E m:;?(l-1[7(l^)-^u 

or w<Cv ,Ly^ —0 

Vu > V, (u, v) £ E, 



— such that L,,,} = 0, 

(L-i[/(L^)-i^ 



Jvv 



2. 



(M^''^)-„i, ■.= iL-'UiL'^)-'U{LDL^)~^, yu,u'>v,{u,v),{u',v)eE; 



A. I L, Yi, Y2, ■ ■ ■ , Y^ ^ IG ('^ - 1, (^''^(f )''^ 



2 

independently for i = 1,2, . . . ,m, where "IG " represents the inverse-gamma 
distribution. 

Remark. The notation w : = in the definition of /x^'*^ above 
means indices w for which is as a function of entries of L. 

Deriving the required conditional distributions in Theorem 2 entails care- 
ful analysis. We first state two lemmas which are essential for deriving these 
distributions. 

Lemmas. Letu>v, {u,v)£E. Then, 
dL~^ 1 1 

Proof. The proof is straightforward and is therefore omitted for brevity. 

□ 

Recall from Proposition 1 that functionally depends on L^v only if 
i > u > V > j . We use this observation repeatedly in our arguments. For 
a given v, to prove conditional multivariate normality of the conditional 
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distribution of L'^, given the others, we shall demonstrate that if we treat D 
and the other columns of L as constants, then iT{{LD L^)~^U) is a quadratic 
form in the entries of L'^ . 

Lemma 4. Letu,u'>v, {u,v),{u' ,v) £ E. Then, 

tr{{LDL^)-'U) = 2{L-'U{L^)-'U{LDL^)-^,, 

OLiuv OLiu'v 

which is functionally independent of the elements of L^. 

Proof. First, note that, 
d 



■t^{{LDL^)-^U) 



<t=l ]=l k=l 



m m m / T ~^ 1 — ^ T ~^ j_ ] — ^ T 1 — ^' 
Sr^Sr^Sr^l ku-^vi-^kj ~^ ^ki ^ku^vj \ff /it o\ 

Z^Z^Z^[ -Jjr, ] Lemma 3) 



i=i j=i k=i ^ 

m m m T T T 

i=l j=l k=l '^^ 
Note that is a lower-triangular matrix. Hence, 
52 



■iv{{LDL^)-^U) 



r^ / m m m T T T 
_ ^ O I \ - \ ^ \ ^ ^ku^vi ^kj ^ 

~~ dL7v[^^,t", D^k ' 

\t=l j=l k=l 



2^"^^"^^"^ ku-'^vi ^ku'-^vj ^ 
i=l j=l k=l ^^'^ 

/mm \ / m r~l r~l 

ku ku' 



= 2{L-'u{L^r\,{LDL^);;l,. 

The second equality above follows by noting that by Proposition 1, L^^ 
is functionally independent of L*^ for all 1 <i <m and L'^^ is functionally 
independent of for all 1 < < m and u> v, and then applying Lemma 3. 
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Using this functional^independence argument above once more, we thereby 
conclude that 2{L~^U{L^Y^)yy{LDL^)~l, is independent of L%. □ 

Proof of Theorem 2. An immediate consequence of Lemma 4 and 
the preceding remark is that we can write tr{{LDL^)~^U) as follows: 

tii{LDL^)-^U) 

u>v,{u,v)^E u'>v,{u' ,v)&E 

X {Luv — bu){Lu/i, — bu') + C, 

where b= {bu)u>v,{u,v)£E and C are independent of L^. In order to eval- 
uate ibu)u>v,{u,v)eE^ note that the term in tT{{LDL^)^^U) which is 
independent of is given by 

(4.2) -2 Yl {{L~'U{L^rX.{LDL^)ul')K' 

u'>v,(u' ,v)eE 

for every u> v, {u,v) G E. However, from the proof of Lemma 4, we alter- 
natively know that 

rs m m m r— Ir— Ir— 1 



dLuv ^ ,^ Dkk 

t=l j=l k=l 

Note that by Lemma 3, L^^L^j is functionally dependent on if and 
only if 7^ and L~j^ 7^ (as a function of L). Hence, the term in 
tr{{LDL'^)~^U) which is independent of is given by 

m m m r^lr^lr^l 

-^^"Mi„7=0 or L-=0} 



j=l j=l k=l 



ku k] 



(4.3) =-2 E^s*« E , 

= -2 E {L-'U),,{LDL^)Zj 

J- vj 

Now, observe the following facts: 
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1. the expressions in (4.2) and (4.3) should be the same for every u > v, 

2. if ^ = ^ ^ ^ ^ ^^^^ g^pj^ ^^j^^ 

then 

If we choose A, ^ and t] as 

A^^, := {L~^U{L^)-\.,{LDL^)-l, Vtx, u' such that L^J, = 0, 

'■= ^— such that L.tJ = 0, 

(L-iC/(L^)-i)„„ 

riu'-=bu yu> v,{u,v) £ E, 

then combining the observations above with (4.2) and (4.3), we obtain 

ti {{LDL'^y^U) 

u>v,{u,v)gE u'>v,{u' ,v)£E 

X (L™-^^'«)(L„,,-/i:;',^) + C. 

As defined ear her, 

(i-'f/).« 



^ — ^ ''™ Vu such that L^J = 0, 



(L-iC/(L^)-i),, 

/^::'''=^::+ E E m:;?(l-1[7(l^)-1)„,(lz)l^)-i^/.:; 

or ii;< Liji/^^ —0 

Vn > t;, (n, v) G E 

and C is independent of L'^ . It follows that under vr^ ~ , the conditional 

distribution of given the other parameters and the data Yi, Y2, . . . , Y„ 
is AA(/x^'^,M^'^). 

For deriving the conditional distribution of the entries of L>, we note that 



-(tr((Li?L^)-iC/)+E™i5«logD«)/2 ^ TT ^ ~(L-^U(LT)-^),,/(2D,-) 

The above leads us to conclude that the conditional distribution of Djj given 

the other parameters and the data Yi, Y2, . . . , Y„ is IG(^ — 1, — — — ~] 
independently for every j = l,2,...,m. □ 
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4.2. Convergence of block Gibbs sampler. The block Gibbs sampling pro- 
cedure, based on the conditional distributions derived above, gives rise to a 
Markov chain. It is natural to ask whether this Markov chain converges to 
the desired distribution vr^ ~ . Convergence properties are sometimes over- 
looked due to the theoretical demands in establishing them. However, they 
yield theoretical safeguards that the block Gibbs sampling algorithm can be 
used for sampling from the posterior distribution. 

We now prove that sufficient conditions for convergence of a Gibbs sam- 
pling Markov chain to its stationary distribution (see [2], Theorem 6) are 
satisfied by the Markov chain corresponding to our block Gibbs sampler. 
Let (f){x I /X, S) denote the AA(/x, S) density evaluated at x. Let ficid \ a, A) 
denote the IG(q;, A) density evaluated at d. Let us fix '0, di, ^2 > arbitrarily. 
Let 

@^,di42 ■■= = {L.D) e @G ■■ \Lij\ <i;,di< Du < d2 Vi > j, e E}. 

We now formally prove the conditions which are sufficient for establishing 
convergence. 

Proposition 3. There exists some 6 > such that, uniformly for all 
= {L,D) E &tp,di,d2' 

^{L%, I /x^''^, M'''^) >8 Vt; = 1, 2, . . . , m - 1, 
JlGlAily-l, ^ \>0 Vz= l,2,...,m. 

Proof. First, by Proposition 1, all entries of L"^ are polynomials in 
the entries of L. Since ©^,^1,^2 bounded and closed, there exists -01 > 
such that 

{L,D) G ©^,di,d2 ^ \LZl\ < ^1 Vn > {u,v) G E. 

Using the above, there exists a constant V'2 > such that if (L, D) G ©^,^1,^2 > 
then 

(4.4) |(L"^C/)™|<V2, \{LDL^)~l,\<^l,2. \{L-^U{L^)-\,\<^^2 

for every 1 < v,u,u' < m. Second, since = 1 for all l<v<m and U is 
positive definite, it follows that there exists a constant ip^ > such that if 
{L,D) e ©^,di,d2' then 

(4.5) V3 < (L-if/(L^)-i)„„ 
for every 1 <v <m. 
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Let {L,D) £ &i(,^di,d2- Note that 

Observe that if C = G and S = (^^21 S22) ^ positive definite 

matrix, then 

(Ci + 5]r/Si2C2)^Sn(Ci + Sr/SiaCa) < C^^C VC G M'". 
If we choose ^ and S as 

:= (L~^t7(L^)-i)™(LDL^)-„i, Vn, u' such that = 0, 

Cu = Vu such that = 0, 

then combining the observation above and the definition of /i^'*^, we get that 

u:L~u=Ou':L~''^,=0 

vu' 

From the definitions in Theorem 2, we also have 

{{M^''')-'fi^'^)u= {L-^U),,{LDL^)zj yu>v,{u,v)eE. 

It follows by (4.4) that for {L,D) £ @ti,,di,d2^ there exists ip^yO such that 

(4.6) (L^ - /i-'^)^(M-'«)-i(L^ - t,-'^) < 

for every u = 1, 2, . . . , m — 1. Also, by the definition of M^'^ , it follows that 
for {L,D) G 0^,(ii,d2)O < l-ZV/^'*^! < 00 and IM'^''^! is a continuous function of 
{L,D). Recall that for a matrix A, \ A\ denotes the determinant of A. Since 
®^,di,d2 is a bounded and closed set, both the maximum and minimum of the 
function IM"'*^! are attained in &^^di,d2 - follows that for {L,D) G 0»/),di,d2) 
there exist < ki < K2 such that 

(4.7) Kl < IM^'^I < K2 

for every = 1, 2, . . . , m - 1. It follows by (4.4), (4.5), (4.6) and (4.7) that 
for {L,D) S 0,/,,di,d2' there exists 61 > such that 

(l){L^^ I fi"'^, M^'^) >6i Vt; = 1, 2, . . . , m - 1. 

Note, furthermore, that if {L,D) £ ©^,^1,^2' then, from (4.4) and (4.5), 
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for every 1 <i <m. Hence, there exists 62 > such that 

/iG(Ai|y-l, 2 l>02 V2 = l,2,...,m. 

Let 6 = min((5i, 52)- Hence, for 6 = (L, D) E 0^,di,d2) 

(/>(L^ I //^'^, M^'^) >S Vz; = 1, 2, . . . , m - 1, 



(L-iC/(L^)-i),, 



/lG(^Ai|y-l, ^ -j>5 Vi = l,2,...,m. ^ 

Recah that = > v,{u,v) £ E}\. Note that the measures cor- 

responding to M{ijl"''^,M'"'^) and A/'(0,/„„) are mutuahy absolutely con- 
tinuous and the corresponding densities with respect to Lebesgue mea- 
sure are positive everywhere on W^^ for all = 1, 2, . . . , m — 1. In addi- 
tion, the measures corresponding to IG(^ — 1, — — ^ — ^—) and IG(^ — 
1,1) are mutually absolutely continuous and the corresponding densities 
with respect to Lebesgue measure are positive everywhere on (0, 00) for 
all i = 1, 2, . . . , m. Also, since @^^di,d2 bounded and closed, HI^^i^ I 
0) ni^i /ig(^ — 1) 1) is bounded on 0^,di,d2 • Combining this with Propo- 
sition 3, all required conditions in [2], Theorem 6 are satisfied. Hence, the 
block Gibbs sampling Markov chain, based on the derived conditional dis- 
tributions, converges to the desired stationary distribution T:u,a- 

We note that in [29], page 18, the authors introduce a procedure to sam- 
ple from the ^-inverse Wishart distributions (these are a narrow subclass 
of our priors vr^'^)- Essentially, at every iteration, they cycle through all 
of the rows of S. At the ith step in an iteration, they sample the vec- 
tor Tif, := (Sjj)jg^(j) from its conditional distribution (Gaussian) given the 
other entries of S and then sample 7^ := from its conditional distribu- 

tion (inverse- gamma) given the other entries of S. Since S is a symmetric 
matrix, for (i,j) E E, the variable appears in as well as S?'. Hence, 
((Sj^,7i), (S^,72), . . . , (S^.,7m)) is not a disjoint partition of the variable 
space. Therefore, their procedure is not strictly a Gibbs sampling procedure 
and its convergence properties are not clear. On the other hand, in our pro- 
cedure, we cycle through (L*^, L^, . . . , L*^, D), which is a disjoint partition 
of the variable space. Hence, our procedure is a Gibbs sampler in the true 
sense. There are also other differences between the two procedures, such as 
the fact that 7^ 7^ Da unless i = m. 



Remark. It is useful to compare our covariance priors to the condition- 
ally conjugate priors introduced by [8] in the complete case. Upon closer 
investigation we discover that the priors of [8] are quite different from 
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■Kjj^ctiL, D). First, they do not consider structural zeros. More importantly, 
under their posterior, the distribution of conditional on D is jointly 
multivariate normal. In the general decomposable covariance graph setting 
however, the zeros of L do not carry over to L~^, and so it is not possible 
in our framework for the distribution of the (constrained or unconstrained) 
elements of L~^, conditional on D, to be jointly multivariate normal. 

5. The special case of homogeneous graphs: Closed form expressions. 

Note that the covariance graph model, that is, the family of distributions 

g = {Mrn{0,^):^ePG} 

(supported on M™) is a curved exponential family for any connected non- 
complete graph G. As discussed earlier, the fact that the family is curved 
renders the Diaconis-Ylvisaker framework no longer applicable in this set- 
ting. Hence, a rich and flexible class of distributions was introduced in order 
to serve as priors for the class of covariance graph models. A natural question 
to ask is whether the class of priors itself belongs to a curved exponential 
family. Indeed, this class of priors is interesting in its own right and war- 
rants an independent investigation. Such analysis has the potential to place 
the class of priors in a known framework and thus exploit this property. 

Let us therefore now turn our attention to the class of priors {^c^'ct}c/eAf+ 
as a family of distributions supported on Pq, with [/ as a parameter. We 
now state a lemma which formally establishes that the class of priors can be 
framed in the context of natural exponential families. 

Lemma 5. For arbitrarily fixed a, the family of distributions {^ij^ct}ueM^ 
is a general exponential family, that is, it can be transformed into a natu- 
ral exponential family. The natural parameter is U = {{Uij))i<i<j<m, the 
corresponding set of sufficient statistics is = {{^^j^))i<i<j<m and the 
cumulant generating function is logZG{U,a). 

Proof. The proof is straightforward and is therefore omitted. □ 

Placing the class of covariance priors in a natural exponential family 
framework yields insights into the structure and functional form of this class 
of distributions. As noted earlier, zg{U, a) is not generally available in closed 
form. A question that naturally arises is whether there are any conditions 
under which ZG{U,a) can be evaluated in closed form. In this section, we 
establish that when G is homogeneous, ZGiU,a) and E^/^q, can be evaluated 



^ Note: not the class of distributions associated with the covariance graph probability 
model but rather the class of priors that is introduced in this paper. 
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in closed form. It is known that when G is a homogeneous graph, the covari- 
ance graph model is Markov equivalent to an appropriate DAG (see [11]). 
It is, however, important to clarify that the Markov equivalence of covari- 
ance graph models and DAGs does not immediately imply that Bayesian 
inference for covariance graph models using our priors automatically fol- 
lows. We also need to establish a correspondence between our priors and 
known priors for DAG models. We now prove that in the special case when 
G is homogeneous, our priors correspond to the standard conjugate priors 
for an appropriate DAG. This yields yet another property of our class of 
priors. The following theorem is the main result of this section and helps us 
establish the aforementioned correspondence. 

Theorem 3. Let G = {V,E) he homogeneous, with vertices ordered ac- 
cording to the Hasse perfect vertex elimination scheme specified in Section 
2.4, that is, V G Sh- If ^ ^u%l ^'^^ ^ ~ LDL^ is its modified Cholesky 
decomposition, then 

l<i<m 

are mutually independent. Furthermore, the distributions of these quantities 
are specified as follows: 

22' 2 J 

Vi = 1, 2, . . . , m. 

Remark. The above result decomposes T, into mutually independent 
coordinates. Note that for any i such that z is a leaf of the Hasse tree and i 
has the minimal label in its equivalence class i, we have 

In this case, it is understood that S^* and are vacuous parameters and 

Proof of Theorem 3. Let G be a homogeneous graph with m vertices, 
with the vertices ordered according to the Hasse perfect elimination scheme 
specified in Section 2.4. Recall that the vertices of the Hasse diagram of G 
are equivalence classes formed by the relation R defined in Section 2.4. The 
vertex labeled m clearly lies in the equivalence class of vertices at the root of 
the corresponding Hasse diagram. Let us remove the vertex labeled m from 
the graph G and let G' denote the induced graph on the remaining m — 1 
vertices. The graph G' can be of the following two types. 
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Case I: If the equivalence class of m contains more than one element, then 
G' is a homogeneous graph with the Hasse diagram having the same depth 
as the Hasse diagram of G, but with one less vertex in the equivalence 
class at the root. Recall that the depth of a tree is the length of the longest 
path from its root to any leaf. 

Case II: If the equivalence class of m contains only one element, then G' is 
a disconnected graph, with the connected components being homogeneous 
graphs with the Hasse diagram having depth one less than the depth of 
the Hasse diagram of G. 

Note that for every 1 <i <m such that J\f^ (i) 7^ (/), can be partitioned 



as 



^1 



Also, note that if Z'^AA(0,S), then Da is the conditional variance of Zi 
given Zi, Z2, . . . , Zi-i (see [16]). Note that Sfc^ = for all 1 < k,l < i,k £ 
Af^{i),l ^Af^{i). It follows that Du = J^u - Hence, by 

the formula for the inverse of a partitioned matrix, it follows that 



-<i\-l 



1 



Hence, 

-tr((S^*)-^ 



U 



■<i\ 



(5.1) 



1 



+ 



{Uu-mf{u^T'u^) 



Dr 



We again note that from our argument at the beginning of the proof, S^* = 
i^a,s a block diagonal structure (after an appropriate permu- 
tation of the rows and columns) with blocks S-*^, S-*^, . . . , S-** for some 
k > 1,1 < ii,i2, ■ . . ,ik < i-^^ It follows that 

k 



tr((S^T^f^^') 



^tr((S 



:<ii 



If the equivalence class of i has k children in the Hasse diagram of G and Vj is the 
set of vertices in V belonging to the subtree rooted at the jth child, then Vj, for 1 < j < fc, 
are disjoint subsets. In fact, if ij = max{i' : i' £ Vj}, then it follows by the construction of 
the Hasse diagram that Vj =Af-{ij) for 1 < j < fc. 
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Note that S = Using (5.1) recursively, we get 

tr(S-^C/) 



1=1 

Let us now evaluate the Jacobian of the transformation 

It follows by simple matrix manipulations that the Jacobian of the transfor- 
mation 

is given by . 

Once more, note that S = S-™ and, as mentioned earlier, S^* = 
or S^' (after an appropriate permutation of the rows and columns) has a 
block diagonal structure with blocks E-*^ , S-*^ , ■ ■ ■ , Tj-^^ for some /c > 1, 1 < 
ii,i2, . . . ,ik <i- Hence, by regarding the transformation 

l<i<m 

as a series of transformations of the type — )■ (S^', (S^*)~^S^, L'jj), it 
follows that the determinant of the Jacobian is given by 

mm m 
i=l i=lj(zX<{^i) j = l 

Here, as in Section 3.1 Lemma 2, 

nj = > j ■■ G E}\ Vj = 1, 2, . . . ,m. 
Also, from Section 3.1, 

Let 

r = {(Ai,(s^^)-^s:^)}i<i<^. 

It follows from the decomposition of tr(S~^J7) from (5.2) and the computa- 
tion of the determinant of the Jacobian (5.3) that 



7r^,„({(A.,(S^*)-^S:^)i<,<„}) 



^ m 
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-i/(2D,,)((s^0-'s:^-(c/-<^)-ic/.7=)^f/^^((s^')-is:^-({/^0-'c^. 



i=l 

m 

X 

i=l 



-Q 1/(2 A, ) (C/,. - (C/.^ r - ^ C/.^ ) -"i /2 _ 



The above proves the mutual independence of {(Z^jj, (S"^*) S;^)}i<,j<m- 
From the joint density of {Da, it is clear that 

A^~AA((C/^*)-lf/.^,A^(f/^*)-'). 

To evaluate the marginal density of Da, we integrate out from 
the joint density of (Ai, Note that 

g-i/{2A0({s^')-'s^-(c/^')-^[^.^rf/^U(s^')-^s^-(c/^')-ic/.t)^(^5.^i)-i5.- 

where C is a constant, since the above integral is essentially an unnormalized 
multivariate normal integral. Hence, the marginal density of Da is given by 

We can therefore conclude that 

" 2 2 2 y □ 

Remark. At first glance, it seems as if the only part of U that appears 
in Theorem 3 is {Uij)(^i^j)(zE-, that is, the projection of U onto Iq- Hence, 
one could incorrectly conclude that up to the number of shape parameters 
in each equivalence class, in the homogeneous case, the priors introduced 
in this paper and the IWq^ are identical. However, a careful inspection 
shows that this is not the case. Note that the conditional covariance of 
is Dii{U""'')~^ , and C/^* can contain entries of the form Uki such 

13 2 

that {k, I) ^ E. For example, suppose that G =• — • — •. Then, 

TT^3 ^ ( Uii Ui2 

but (1, 2) ^ E. Hence, in the homogeneous setting, vr^'^ is truly a larger class 
than the IWq^^ family of distributions. 

We now establish the correspondence between irfj^ in (5.4) and the con- 
jugate prior for an appropriate Gaussian DAG model. Let G = (V, E) be a 
homogeneous graph with V € Sh, that is, the vertices have been ordered 
according to the perfect vertex elimination scheme for homogeneous graphs 
outlined in Section 2.4. Let us construct a DAG as follows: 
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Fig. 2. Example of homogeneous covariance graph (left); associated DAG model (right). 

1. Consider the Hasse diagram of G (for simplicity and clarity of exposition, 
assume that the equivalence class at each vertex has just one element). 

2. Assign a directed edge from n to f if u is a descendant of v in the Hasse 
tree, that is, reverse the directions of all the arrows, including those that 
do not appear in the Hasse tree, but which are implied by transitivity. 

An example of a DAG constructed in this manner is given in Figure 2. 

Now, let pa{i) denote the set of parents of i according to the direction 
specified above. If Y ~ A/'m(0,S) has a distribution which is Markov with 
respect to the above DAG, then the density of Y factorizes as 



f{y) = Y[f{yi I ypa(i)) 



i=l 
m 



where Ai := - (EJ)^(S^')-'Si = 

The standard conjugate prior for each factor of the product above can 
be obtained as follows. Given an arbitrary positive definite matrix U and 
> for i = 1, 2, . . . , m, let 

, Uu-{u:<f{u^T'ut 



A,;~IG a: 



for z = l,2,...,m, where {(Djj, (S^*)~^S;^)}i<j<m are mutually indepen- 
dent. This corresponds precisely to the tt^^^ density in (5.4). 

We now proceed to state, without proof, results for homogeneous co- 
variance graph models by exploiting the correspondence of our priors to the 
standard conjugate priors for DAGs. In particular, hyper-Markov properties, 
the normalizing constant and expected values for covariance graph models, 
for G homogeneous, are formally stated below. 
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Let G = (y, E) be a homogeneous graph with V G Sh, that is, the vertices 
have been ordered according to the perfect vertex ehmination scheme for 
homogeneous graphs outhned in Section 2.4. Let D be the directed graph 
obtained from G by directing all edges in G from the vertex with the smallest 
number to the vertex with the highest number. Let pa(i) denote the set 
of parents of i according to the direction specified in D. It follows that 
pa(i) =M^{i). As in [9, 20], let pr(i) = {l,2,...,i - 1} denote the set of 
predecessors of i according to the direction specified in D. We now proceed 
to define the hyper-Markov property. 

Definition 1. A family of priors J- on Pq satisfies the strong hyper- 
Markov property with respect to the direction D if, whenever tt G and 

S ~ TT, 

Si|pa(i) -L Spr(i) VI < i < m, 
where S,|p,(,) := S,, - {^<^r^T.< = D^. 

In the following corollary, we state, without proof, that the family of 
priors vr^'^ satisfies the strong hyper-Markov property with respect to the 
direction D. 

Corollary 1. Let G = {V,E) he homogeneous with V e Sh- If ^ ^ 

Da _L S{i,2,...,j-i} VI < i < m. 
Remark. Recah that 

is different from 

^{l,2,...,i-l} = {{^uv))l<u,v<i-l- 

We demonstrated in Section 3.2 that the family IWq^ of Letac and Mas- 
sam [20] is a subfamily of our class of priors vr^'^ when G is homogeneous. 
Consequently, we can now prove hyper-Markov properties for the IWq^^ 
family. 

Corollary 2. Let G = {V,E) be homogeneous with V € Sh- Let D be 
the directed graph obtained from G by directing all edges in G from the vertex 
with the smallest number to the vertex with the highest number. The family 
IWq^ is then strong hyper-Markov with respect to the direction Gh- 
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Hyper-Markov properties for the IWq^ family were not established in 
[20]. Hence, we note that the corollary above is a new result for this family. 

We now proceed to state, without proof, the functional form of the nor- 
malizing constant for homogeneous graphs, once again exploiting the corre- 
spondence between our covariance priors and the conjugate priors for DAGs. 
In particular, below, we state necessary and sufficient conditions for exis- 
tence of the normalizing constant and give an explicit expression for it in 
such cases. 



Corollary 3. Let G = {V,E) be a homogeneous graph with vertices 
ordered such that V G Sh- Then, zg{U,cx) < oo if and only if a satisfies the 
conditions in Proposition 1, that is, Oj > |A/'^(i)| -|- 2 Vi = 1, 2, . . . , m. In this 
case, 



1=1 ^ ^ 



(5.5) 

^ |f^^i|a,/2-|Ar^(i)|/2-3/2 y | jy^i |a,/2-|A/'^ (i)|/2-l ^ 

We now proceed to state, without proof, expected values related to our 
class of priors vr^'^ when G is homogeneous, again by exploiting the cor- 
respondence between our covariance priors and conjugate priors for DAGs. 
In particular, we now provide a recursive method that gives closed form 
expressions for the expected value of the covariance matrix when 
Since J^uv = y{u,v) ^ E, we only need to evaluate the expectation of Sjj 
and for every 1 <i <m. Let 

Ai ■.= {ieV:Af^{i) = ^}. 

Clearly, if i£ Ai, then S^* and S.j are vacuous parameters and Da = Sjj. 
It follows from Theorem 3 that for i G Ai, 



assuming that ctj > 4, since X ~ IG(A,7) implies that E[X] = 
For A; = 2, 3, 4, ... , define 



Ak = \ieV:M^{i)^\jAi\\i\jA 



k-l ^ ^ /fc-l 



1=1 ) \ \Z=1 



Since there are finitely many vertices in V, there exists some k* such that 
Ak = (p ioi k > k* . The sets {Ak}i<k<k* essentially provide a way of com- 
puting E(/^Q,[E], by starting at the bottom of the Hasse diagram of G and 
then moving up sequentially. 
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Corollary 4. Let G be a homogeneous graph. Given the expectations 
of T,^ and Tijj for j G U/=i^^^ expectations of Tj^ and Sjj for i & Ak 
are given, respectively, by the expressions 



ai-\M^{i)\-4 
+ tr( Ef;,„P 



ai-\M'<{i)\-4: 
provided that Ui > \M'^{i) \ +4. 



The corollary is not formally proved since it follows directly from the 
correspondence between our covariance priors and the natural conjugate 
priors for DAGs. We note once more that the expressions above yield a 
recursive but closed form method to calculate E[S] when S ~ vr^'^. 

Remark. There is an intriguing parallel between the expressions for the 
normalizing constant and the expected values for the vr^'^ distribution and 
the IW Pq distribution (as derived in [20, 26]) when G is homogeneous. This 
automatically leads one to wonder if the vr^*^ and the IW distributions 
are the same. We now show that this is not the case. 

If one compares the density of (Djj, (S^*)~-^S;^)i<j<m in (5.4) and the 
IW*p^ density in (3.16) of [20], they initially appear to have the same 
functional form. We now proceed to show that they are supported on dif- 
ferent spaces. This difference is illustrated by the following example. Let 

13 2 

G =• — • — •. In this case, each equivalence class in the Hasse tree of 
G has exactly one vertex. Note that vertex 3 has two descendants, and 
vertices 1 and 2 do not have any descendants in the Hasse tree of G. 
Hence, it follows that the density of (Djj, (S"^*)~-'^S^)i<j<3 is supported on 
(M X NULL,R X NULL,R x M^). On the other hand, vertices 1 and 2 have 
one ancestor, and vertex 3 has no ancestors in the Hasse tree of G. Hence, 
it follows that the IW*p^ density is supported on (M x M,M x R,R x NULL). 

So, at first glance, it looks as if {Da, (S"^*)~-'^S;^)i<i<m has the same form 
as IWp^, but, upon further examination, we see that even for the simplest 
homogeneous graph, they are structurally different. In fact, (M x NULL,M x 
NULL,Rx M^) does not support the IW*p^ distribut ion for any G that is 
homogeneous. 
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Table 1 

Empirical covariance matrix for yeast data 



GALll GAL4 GAL80 GALS GAL7 GALIO GALl GAL2 

GALll 0.152 



GAL4 


0.034 


0.130 












GAL80 


0.015 


0.039 


0.221 










GALS 


-0.055 


0.034 


0.073 


0.608 








GAL7 


-0.051 


-0.053 


0.183 


0.722 


3.423 






GALIO 


-0.048 


-0.039 


-0.188 


0.553 


2.503 


2.372 




GALl 


-0.066 


-0.061 


0.224 


0.517 


2.768 


2.409 


2.890 


GAL2 


-0.119 


-0.018 


0.208 


0.583 


2.547 


2.278 


2.514 



6. Examples. The main purpose of this paper is to undertake a theo- 
retical investigation of our class of distributions and their efficacy for use 
in Bayesian estimation in covariance graph models. We nevertheless pro- 
vide two examples (one real and one simulated) to demonstrate how the 
methodology developed in this paper can be implemented. 

6.1. Genomics example. We provide an illustration of our methods on 
a data set consisting of gene expression data from microarray experiments 
with yeast strands from Gasch et al. [13]. This data set has also been ana- 
lyzed in [4, 11]. As in [4, 11], we consider a subset of eight genes involved 
in galactose utilization. There are n = 134 experiments and the empirical 
covariance matrix for these measurements is provided in Table 1. Note that 
the sample covariance matrix is obtained after centering since the mean is 
not assumed to be zero. 

We consider the covariance graph model specified by the graph G in Fig- 
ure 3 with the overall aim of estimating S under this covariance graph model. 
The maximum likelihood estimate for S G Pq , provided by the iterative con- 
ditional fitting algorithm described in [4], yields a deviance of 4.694 over 7 
degrees of freedom, thus indicating a good model fit. The maximum likeli- 
hood estimate is provided in Table 1. We use the following ordering for our 
analysis: {GALll, GAL4, GAL80, GAL3, GAL7, GALIO, GALl, GAL2}. 

Our goal is to obtain the posterior mean for S under our new class of pri- 
ors and then to provide Bayes estimators for E. We use two diffuse priors to 
illustrate our methodology. The first prior is denoted as vr^/j where Ui = 
ilM/g, Q,i = 5 + 17^^ (i) I , i = 1, 2, . . . , 8, that is, Qi = (5, 6, 6, 8, 7, 8, 9, 12) . The 
second prior used is t^u^^o.'^ , where U2 = 0, = 2, i = 1,2, . . . ,8. Note that we 
could have used any ordering in Sd for our analysis. As an example, we select 
an alternate ordering, {GALll, GAL4, GAL80, GALIO, GAL2, GALS, GALl, 
GAL7}, and also consider the two priors mentioned above under this alter- 
native ordering. The block Gibbs sampling procedure was run for the four 
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priors as specified in Section 4. The burn-in period was chosen to be 1,000 
iterations and the subsequent 1,000 iterations were used to compute the 
posterior mean. Increasing the burn-in period to more than 1,000 iterations 
results in insignificant changes to our estimates, thus indicating that the 
burn-in period chosen is sufficient. The posterior mean estimates for both 
the priors, together with the MLE estimates, are provided in Table 2. The 
running time for the Gibbs sampling procedure for each prior is approxi- 
mately 26 seconds on a Pentium M 1.6 GHz processor. We find that the 
Bayesian approach using our priors and the corresponding block Gibbs sam- 
pler gives stable estimates and thus yields a useful alternative methodology 
for inference in covariance graph models. We also note that the two different 
vertex orderings yield very similar results. 

6.2. Simulation example. A proof of convergence of the block Gibbs sam- 
pling algorithm proposed in Section 4.1 was provided in Section 4.2. The 
speed at which convergence occurs is also a very important concern for 
implementation of the algorithm. The number of steps that are required be- 
fore one can generate a reasonable approximate sample from the posterior 
distribution is reflective of the rate of convergence. Understanding this is 
important for the accuracy of Bayes estimates such as the posterior mean. 
We proceed to investigate the performance of the block Gibbs sampling al- 
gorithm in a situation where the posterior mean is known exactly and hence 




Fig. 3. Covanance graph for yeast data. 
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Table 2 

ICF: Maximum likelihood estimate from iterative conditional fitting. BYl; Bayesian 
posterior mean estimate for prior tt^j^^i . BY2; Bayesian posterior mean estimate for 
prior i^u2,a.^ ■ BYl; Bayesian posterior mean estimate for prior ttj/^ ,^1 with a different 
ordering. BY2; Bayesian posterior mean estimate for prior tt^^ ,^2 with a different 

ordering 





GALll 


GAL4 


GAL80 


GALS 


GAL7 


GALIO 


GALl 


GAL2 


Method 


GAL 11 


0.152 


0.030 





—0.052 











—0.068 


ICF 




0.164 


0.030 





-0.050 











-0.068 


BYl 




0.156 


0.030 





-0.052 











-0.068 


BY2 




0.152 


0.030 





-0.051 











-0.069 


BYl 




0.155 


0.030 





-0.052 











-0.070 


BY2 


GAL4 




0.128 


0.040 


0.042 











0.030 


ICF 






0.142 


0.040 


0.041 











0.027 


BYl 






0.133 


0.041 


0.042 











0.028 


BY2 






0.128 


0.039 


0.040 











0.026 


BYl 






0.132 


0.040 


0.042 











0.0278 


BY2 


GAL80 






0.223 


0.082 


0.197 


0.198 


0.239 


0.227 


ICF 








0.237 


0.072 


0.193 


0.194 


0.235 


0.216 


BYl 








0.232 


0.076 


0.199 


0.2 


0.243 


0.223 


BY2 








0.224 


0.076 


0.197 


0.197 


0.240 


0.218 


BYl 








0.232 


0.076 


0.202 


0.203 


0.245 


0.227 


BY2 


GALS 








0.612 


0.723 


0.549 


0.515 


0.582 


ICF 










0.626 


0.713 


0.544 


0.509 


0.575 


BYl 










0.643 


0.747 


0.568 


0.532 


0.599 


BY2 










0.628 


0.719 


0.549 


0.517 


0.582 


BYl 










0.667 


0.749 


0.574 


0.531 


0.605 


BY2 


GAL7 










3.422 


2.593 


2.768 


2.540 


ICF 












3.462 


2.584 


2.756 


2.533 


BYl 












3.588 


2.682 


2.866 


2.636 


BY2 












3.541 


2.588 


2.761 


2.532 


BYl 












3.708 


2.681 


2.865 


2.627 


BY2 


GALIO 












2.372 


2.409 


2.267 


ICF 














2.373 


2.400 


2.266 


BYl 














2.453 


2.497 


2.358 


BY2 














2.389 


2.407 


2.277 


BYl 














2.473 


2.489 


2.356 


BY2 


GALl 














2.890 


2.502 


ICF 
















2.961 


2.501 


BYl 
















3.086 


2.604 


BY2 
















2.969 


2.496 


BYl 
















3.087 


2.582 


BY2 


GAL2 
















2.870 


ICF 


















3.003 


BYl 


















3.153 


BY2 


















2.892 


BYl 


















3.005 


BY2 



WISHART DISTRIBUTIONS 



37 




Fig. 4. Hasse diagram for a homogeneous graph with 50 vertices. 

allows a direct comparison. Consider a homogeneous graph G with 50 ver- 
tices, with the corresponding Hasse tree given by Figure 4. Let S G Pq, where 
the vertices have been ordered according to the Hasse perfect vertex elimina- 
tion scheme of Section 2.4, the diagonal entries are 50 and all other nonzero 
entries are 1. We simulate 100 observation vectors Yi, Y2, Y3, . . . , Y99, Yioo 
from ^50(0, S). For illustration purposes, we choose a diffuse prior vr^^ with 
[7 = and = 2|7V<(i)| 5,i = 1,2, . . . ,50. 

Since the graph G is homogeneous, we can compute the posterior mean 
Smean '■='^u,a\^ \ Yi , Y2 , . . . , Yiqo] explicitly. We can therefore assess the 
ability of the block Gibbs sampling algorithm to estimate the posterior mean 
by comparing it to the true value of the mean. We run the block Gibbs sam- 
pling algorithm to sample from the posterior distribution and subsequently 
check its performance in estimating Smcan- We use an initial burn- in pe- 
riod of B iterations and then average over the next / iterations to get the 
estimate S. The times needed for computation (using the R software) and 

the relative errors H^~^"^°'^"ll2 corresponding to various choices of B and / 

II mean 1 1 2 

are provided in Table 3. The diagnostics in Table 3 indicate that the block 
Gibbs sampling algorithm performs exceptionally well, yielding estimates 
that approach the true mean in only a few thousand steps. The time taken 
for running the algorithm is also provided in Table 3. 

The diagnostics in Table 3 indicate that the block Gibbs sampling al- 
gorithm performs exceptionally well, yielding estimates that approach the 
true mean in only a few thousand steps. The time taken for running the 
algorithm is also provided in Table 3. 

7. Closing remarks. In this paper, we have proposed a theoretical frame- 
work for Bayesian inference in covariance graph models. The main challenge 
was the unexplored terrain of working with curved exponential families in 
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the continuous setting. A rich class of conjugate priors has been developed 
in this paper for covariance graph models where the underlying graph is 
decomposable. 

We have been able to exploit the structure of the conjugate priors to 
develop a block Gibbs sampler to effectively sample from the posterior dis- 
tribution. A rigorous proof of convergence is also given. Comparison with 
other classes of priors is also undertaken. We are able to compute the nor- 
malizing constant for homogeneous graphs, thereby making Bayesian model 
selection possible in a tractable way for this class of models. The Bayesian 
approach yields additional dividends, in the sense that we can now carry 
out inference in covariance graph models, even when the sample size n is 
less than the dimension p of the data, something which is otherwise not 
generally possible in the maximum likelihood framework. Furthermore, we 
thoroughly explore the theoretical properties of our class of conjugate pri- 
ors. In particular, in the homogeneous case, hyper-Markov properties and 
closed form expressions for the expected value of the covariance matrix are 
established. Furthermore, the usefulness of the methodology that is devel- 
oped is illustrated through examples. A couple of open problems are worth 
mentioning: 

• What are the necessary conditions for the existence of the normalizing 
constant for decomposable graphs? 

• Does the hyper-Markov property for the class of priors developed in this 
paper hold for decomposable graphs? 

We conclude by noting that the use of the class of Wishart distributions 
introduced in this paper for Bayesian inference, along with a detailed study 
of Bayes estimators in this context, is clearly an important topic and is the 
focus of current research. 







Table 3 




Performance 


assessment 


of the Gibbs sampler 


in simulation 






example 




Burn-in (B) 


Average (I) 


Time (seconds) 


Relative error 


1000 


1000 


139.77 


0.01748220 


2000 


1000 


209.72 


0.01240595 


3000 


1000 


279.52 


0.01300910 


4000 


1000 


349.44 


0.01142864 


4000 


3000 


489.19 


0.01246141 


4000 


5000 


631.21 


0.01081264 


4000 


7000 


769.70 


0.009244206 
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APPENDIX 

Proof of Proposition 1. Prom the definition of N and L, it is easy 
to verify that 

{LN)ii = 1 VI < i < m, 
{LN)ij = VI < i < J < m. 
Now, let i> j. It foHows, by the definition of N, that 

i 

{LN)ij = UkNkj 

k=j 

i-l 

k=j + l T&A,Tl=k,T^i,^^(^^-f=j 

i-l 

Note that any t' £ A with t[ = i, t^jj^j^j-^,-) = j, dim(T') > 2 can be uniquely 
expressed as r' = (i, r), where i + 1 < ti < i — 1, Tdim(T-) = j- Recall that, by 
definition, Lt-i = Lir^L-r- Also, if r' G ^ with t[ = i, T^juj)-^/) = j, dim(T') = 
2, then r' = and Lt' = Lij. Hence, 

= Nij - Nij 
= 0. 

Hence, LN = I and thus = N. □ 

Proof of Theorem 1. Let us simplify the integral by integrating out 
the terms Da, 1 <i <m: 

-(tr((LDLT)-iC/)+E™li a. logD.,)/2 ^jy 



n 



1 
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^ r(a,/2-l)2"-/2-^ 

t\ {{L-w{LT)-^ur^/^-^ 

(assuming Qj > 2 Vi = 1, 2, . . . , m) 

n r(a./2 - 1)2"-/^-^ 
ii((L-)..f/((L-i)..n-/2-'^- ^**^ 

In order to simplify this integral, we perform a change of measure by trans- 
forming the nonzero elements of L to the corresponding elements of L~^. 
For convenience and brevity, the notation L^-^ is used in place of {L~^)ij. 
Now, note the following facts. 

1. Let L G Cg- From Proposition 1, for £ E, i> j, 

(A.l) = —Lij + f{{Luv)(u,v)£E,j<u<i,j<v<u or u=i,j<v<i)i 

that is, L^j^ + Lij is a function (/) of Luv, iu,v) £E,j<u<i,j<v< 
u or u = i, j < V < i, such that / is zero when all its arguments are zero. 
We use the above to show that L is a function of {-^^^t^}u>i,,{M,ii)g_E- Let 
i* = min{i : Lij ^ for some j <i}. Let j* = max{j : Li*j ^ 0}. By (A.l) 
and the definition of i* and j* , we have Li*j* = — L^^^*. We proceed by 
induction. Let i > j, G E and suppose that the hypothesis is true 
for all {u,v) £ E,l < u < i, l<v<uoru = i, j<v<i. Then, 



Lij — L- + f{{Luv){u,v)GE,j<u<i,j<v<u or u=i,j<v<i) 

and, by the induction hypothesis, the right-hand side of the above equa- 
tion is a function of {L~^}u>v,{u,v)£E ■ Hence, the matrix L is a function 

of {^uv}u>v,{u,v)eE- 

It follows that the transformation 

{^ij}ii,j)&E,i>j {^i/}ii,j)&E,i>j 

is a bijection and the absolute value of the Jacobian of this transformation 
is 1 since it is the determinant of a lower-triangular matrix with diagonal 
entries 1. 

/C/ll C/l2 



2. If x= (^^^ j and U = yjj^^^ u^^j is a positive definite matrix, then 

(A.2) X^[/X = z'^Z + ^1{U22 - f/2lf/r/f/l2)x2 > X^([/22 " f/2lf/r/C/l2)x2, 

where z = C/|/^xi + U'^^^'^Ui2X.2- 

Hence, after transforming the nonzero entries of L to the corresponding 
entries of and using (A.2) to eliminate the dependent entries of 
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from the integrand, we get 

g-(tr((LDL^)-iC/)+E™ 1 log Ai)/2 ^jj 



r(a,/2-l)2"-/2-i 
t\ ((L-i)..C/((L-i),.)^)--/2-i 



<kT\ —da^. 



i=2' 



(')l((af 1)C/*(10) 



1 



Here, K \s a, constant, U* is an appropriate positive definite matrix and 
represents the independent entries in the ith row of . By a suitable hnear 
transformation bj of each of the aj, i = 2, 3, . . . , m, we get 

g-(tr{(LZ)LT)-iC/)+E™=i a. logA,)/2 ^jj 
m ^ 

Here, i^T* and n**, i = 2,3, . . . ,m, are constants. Using the standard fact that 

■ ax < oo II 7 > — , 



(x^x + l)7 " ' ^ 2'' 

we conclude that 

g-(tr((LDL^)-f/)+Er=i".logD.,)/2^^^^^^ 



if ai > |7V^ (i) I + 2 for alH = 1, 2, . . . , m. □ 
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